Hive - A Warehousing Solution Over a Map-Reduce Framework
نویسندگان
چکیده
The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop [3] is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. In this paper, we present Hive, an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language HiveQL, which are compiled into map-reduce jobs executed on Hadoop. In addition, HiveQL supports custom map-reduce scripts to be plugged into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested compositions of the same. The underlying IO libraries can be extended to query data in custom formats. Hive also includes a system catalog, Hive-Metastore, containing schemas and statistics, which is useful in data exploration and query optimization. In Facebook, the Hive warehouse contains several thousand tables with over 700 terabytes of data and is being used extensively for both reporting and ad-hoc analyses by more than 100 users. The rest of the paper is organized as follows. Section 2 describes the Hive data model and the HiveQL language with an example. Section 3 describes the Hive system architecture and an overview of the query life cycle. Section 4 provides a walk-through of the demonstration. We conclude with future work in Section 5.
منابع مشابه
Massive Multi-Omics Microbiome Database (M3DB): A Scalable Data Warehouse and Analytics Platform for Microbiome Datasets
Massive Multi-Omics Microbiome Database (MDB) is a data warehousing and analytics solution designed to handle diverse, complex, and unprecedented volumes of sequence and taxonomic classification data obtained in a typical microbiome project using NGS technologies. MDB is a platform developed on Apache Hadoop, Apache Hive and PostgreSQL technologies. It enables users to store, analyze and manage...
متن کاملA Context-Based Performance Enhancement Algorithm for Columnar Storage in MapReduce with Hive
To achieve high reliability and scalability, most large-scale data warehouse systems have adopted the clusterbased architecture. In this context, MapReduce has emerged as a promising architecture for large scale data warehousing and data analytics on commodity clusters. The MapReduce framework offers several lucrative features such as high fault-tolerance, scalability and use of a variety of ha...
متن کاملAdaptive Prejoin Approach for Performance Optimization in MapReduce-based Warehouses
MapReduce-based warehousing solutions (e.g. Hive) for big data analytics with the capabilities of storing and analyzing high volume of both structured and unstructured data in a scalable file system have emerged recently. Their efficient data loading features enable a so-called near real-time warehousing solution in contrast to those offered by conventional data warehouses with complex, long-ru...
متن کاملARPN Journal of Science and Technology::Analysis of Movie Lens Data Set using Hive
Large scale data set provides the better opportunity to find out much better data relationship in the area of business intelligence. In the paper, we implement our systems using Hadoop that has been popular to store and compute Big Data. However, it is not easy to write Hadoop Map Reduce code. Therefore, we use Hive and Hive QL codes to understand the relationships between ratings and the users...
متن کاملA Framework for Semi-Automated Implementation of Multidimensional Data Models
Data warehousing solution development represents a challenging task which requires the employment of considerable resources on behalf of enterprises and sustained commitment from the stakeholders. Costs derive mostly from the amount of time invested in the design and physical implementation of these large projects, time that we consider, may be decreased through the automation of several proces...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 2 شماره
صفحات -
تاریخ انتشار 2009